Tools to handle trait macroecological datasets in R

The rmacroRDM package contains functions to help with the compilation of macroecological datasets. It compiles datasets into a master long database of individual observations, matched to a specified master species list. It also checks, separates and stores taxonomic and metadata information on the observations, variables and datasets contained in the data. It therefore aims to ensure full traceability of datapoints and as robust quality control, all the way through to the extracted analytical datasets.

For more details and context, see the rmacroRDM github repo

R code for this workflow available here



1 setup

1.1 source rmacroRDM functions

First source the rmacroRDM functions. Currently the best way is to just source from github using RCulr:getURL().

WARNING: Repo under continuous development

require(RCurl)
eval(parse(text = getURL("https://raw.githubusercontent.com/annakrystalli/rmacroRDM/master/R/functions.R", ssl.verifypeer = FALSE)))
eval(parse(text = getURL("https://raw.githubusercontent.com/annakrystalli/rmacroRDM/master/R/wideData_function.R", ssl.verifypeer = FALSE)))

1.2 setup file.system

Next, to initialise the project we need to supply valid pathways to project folders containing:

  • data: folder containg private input and output folders are stored, usually a googledrive folder) and
  • code: folder containing scripts associated with the project. This is usually an RStudio project directory, ideally version controlled on github.

The function sets those directories in the global environment.

setDirectories(script.folder = "~/Documents/workflows/Brain_size_evolution/", 
               data.folder = "~/Google Drive/Brain evolution/",
               envir = globalenv())

Once project folders have been set, we set up the file system by creating the required folders (if they don’t exist already) in the project folders.

setupFileSystem(script.folder = "~/Documents/workflows/Brain_size_evolution/", 
                data.folder = "~/Google Drive/Brain evolution/")

1.3 initialise database configurations and attach to search pathway

In this step, we initialise the environment with some required parameters to build the database and process files to it. The call below shows the default initialisation settings which you would get if you just called inti_db()

init_db(var.vars = c("var", "value", "data.ID"),
                    match.vars = c("synonyms", "data.status"),
                    meta.vars = c("qc", "observer", "ref", "n", "notes"),
                    taxo.vars = c("genus", "family", "order"),
                    spp.list_src = NULL)

I actually want to set “D0” as the file from which to extract the spp.list in a bit so I set spp.list_src = "D0".

init_db(spp.list_src = "D0")
configuring: 
master.vars
match.vars
meta.vars
spp.list_src
taxo.vars
var.vars
NULL

The function appends the given arguments to environment master_config at position 2 in the search path (note position of GlobalEnvironment = 1).

Here’s a list of the values of the objects we just attached as configurations:

1.4 setup input.folder

Next we setup the folders in input.folder/pre/ and post/ according to the configurations set in the previous step.

If the correct setup already exists, no action is taken:

setupInputFolder(input.folder)
dirs <- list.dirs(paste(input.folder, "pre/", sep = ""), full.names = T)

print(dirs)
[1] "/Users/Anna/Google Drive/Brain evolution/inputs/data/pre/"         
[2] "/Users/Anna/Google Drive/Brain evolution/inputs/data/pre//csv"     
[3] "/Users/Anna/Google Drive/Brain evolution/inputs/data/pre//n"       
[4] "/Users/Anna/Google Drive/Brain evolution/inputs/data/pre//notes"   
[5] "/Users/Anna/Google Drive/Brain evolution/inputs/data/pre//observer"
[6] "/Users/Anna/Google Drive/Brain evolution/inputs/data/pre//qc"      
[7] "/Users/Anna/Google Drive/Brain evolution/inputs/data/pre//ref"     



2 populate file.system

The functions take advantage of the structure of the file.sytem to automate loading and linking of data and metadata through appropriate naming and location of files within the file.system.

  • raw/ organise all raw data
  • pre/ save copies of the raw data files in the appropriate data (cvs) or meta.vars folders.

NB meta.var data sheets should be named with the same name as the data data sheet. Take care during this stage to ensure files are named correctly and stored in the appropriate folders.

3 load required data

3.1 make fcodes (folder codes) vector

The fcodes vector specifies the details of folders in pre/ and post/ input.folder folders. It also creates appropriate code prefixes for each type of data or meta.var sheet. Note that “D” is reserved for data files, “R” for ref files and “N” for n files.

fcodes <- ensure_fcodes(meta.vars)
print(fcodes)
         D          R          N          Q          O         NO 
     "csv"      "ref"        "n"       "qc" "observer"    "notes" 

3.2 set file.names vector

Specify the file.names of the files you wish to process. If you want to process all files in the file.system use file.names = NULL.

file.names <- create_file.names(file.names = c("brainmain2.csv", 
                                  "Amniote_Database_Aug_2015.csv", "anagedatasetf.csv"))
dcodes succesfully extracted from 'data_log.csv'
print(file.names)
                             x0                              x1 
               "brainmain2.csv" "Amniote_Database_Aug_2015.csv" 
                             x2 
            "anagedatasetf.csv" 

3.3 load system reference files

Load the system reference (sys.ref) files required for data processing.

load_sys.ref(fileEncoding = "mac", view = F)
loading: 
metadata
data_log
vnames
attaching to env: sys.ref
NULL

*use view = T to open a viewer for each of the sys.ref files on load.

3.3.1 metadata.csv

kable(head(metadata, 10))
code cat descr scores levels type units
species NOMINAL Scientific name NA NA NA NA
brain.volume MORPHOLOGICAL Brain volume NA NA CON mm
brain.mass MORPHOLOGICAL Brain mass NA NA CON g
male.brain.mass MORPHOLOGICAL Male brain mass NA NA CON g
female.brain.mass MORPHOLOGICAL Female brain mass NA NA CON g
body.mass MORPHOLOGICAL Body mass NA NA CON g
male.body.mass MORPHOLOGICAL Male body mass NA NA CON g
female.body.mass MORPHOLOGICAL Female body mass NA NA CON g
telencephalic.volume.fraction MORPHOLOGICAL Telencephalic volume fraction NA NA CON mm
female.maturity LIFE HISTORY TRAIT Female maturity NA NA INT d

3.3.2 data_log.csv

kable(data_log)
dcode file.name descr source source.contact method notes
D0 brainmain2.csv NA NA NA NA NA
D1 Amniote_Database_Aug_2015.csv NA NA NA NA NA
D2 anagedatasetf.csv NA NA NA NA NA
D3 lifehistraits.csv NA NA NA NA NA
D4 parentalcare.csv NA NA NA NA NA

3.3.3 vnames.csv

kable(head(vnames, 10))
code D0 D1 D2 D3 D4 R1 N1
species Scientific name species Scientific.name Scientific name Scientific_name species species
genus NA genus NA NA NA genus genus
brain.volume_n n NA Sample.size NA NA NA NA
brain.volume Brain Vol. NA NA NA NA NA NA
brain.volume_ref Source brain vol. NA NA NA NA NA NA
brain.mass Brain mass (g) NA NA NA NA NA NA
brain.mass_n n brain mass NA NA NA NA NA NA
brain.mass_ref source brain mass NA NA NA NA NA NA
male.brain.mass male brain mass NA NA NA NA NA NA
male.brain.mass_n male brain mass no NA NA NA NA NA NA

3.4 load syn.links.csv

Used for taxonomic matching (plans to automate this by integrating package taxize. supplied syn.links only relates to birds).

syn.links <- read.csv(text=getURL("https://raw.githubusercontent.com/annakrystalli/rmacroRDM/master/data/input/taxo/syn.links.csv", 
                                  ssl.verifypeer = FALSE), header=T)



4 process file.system

This step processes the .csv copies of the raw data in the pre/ folder and writes the processed files as .csv to the post/ folder, ready to be matched. The processing conserves the file.names throughout. It is important for this step that the file.system is correctly populated (ie. meta.var data sheets should be named with the same name as the data data sheet and in the correct folder).


The function runs a basic processing stage for each file in file.names:

  • processing is iterated over:
    • all file.names available in the file.system if file.system = "fromFS",
    • files specified in file.names if file.names is vector of file names (note that only files available in the file.system are processed).
  • data are loaded, trimmed of whitespace, blank lines and c("", " ", "NA", "-999") coded as NAs by default.
  • column names is the data are matched to master variable codes through vnames
  • if the original dataset contains species details across two columns (ie species and genus), data in the columns are concatinated in the form "genus_species" and merged into a single species column. ensure column in files containing genus data is matched to code genus in vnames.


custom processing scripts

you can include extra processing scripts for individual files by adding them into the {script.folder}process/ folder. To be loaded correctly, scripts need to be named appropriately:

  • to source across all files in file.system matching the file.name: name script as file.name. eg ."Amniote_Database_Aug_2015.R"
  • to source for files in a specific folder in file.system matching the file.name: name script as file.name appended with appropriate fcode, eg ."Amniote_Database_Aug_2015_ref.R"
process_file.system(file.names, fcodes)
[1] "all files in file system have valid vnames columns"
===================================================================
processing Amniote_Database_Aug_2015.csv D1 
[1] "source complete"
sourced: process/Amniote_Database_Aug_2015.R
'genus_species' combined in species column. Genus data removed 
df nrow (species) : 9802 df ncol (traits) : 20 
*** 
===================================================================
processing anagedatasetf.csv D2 
df nrow (species) : 1191 df ncol (traits) : 18 
*** 
===================================================================
processing brainmain2.csv D0 
duplicate species name in D0
df nrow (species) : 2586 df ncol (traits) : 24 
*** 
===================================================================
processing Amniote_Database_Aug_2015.csv R1 
[1] "source complete"
sourced: process/Amniote_Database_Aug_2015.R
'genus_species' combined in species column. Genus data removed 
df nrow (species) : 9802 df ncol (traits) : 20 
*** 
===================================================================
processing Amniote_Database_Aug_2015.csv N1 
[1] "source complete"
sourced: process/Amniote_Database_Aug_2015.R
'genus_species' combined in species column. Genus data removed 
df nrow (species) : 9802 df ncol (traits) : 19 
*** 



5 Create database

5.1 create spp.list

Use spp.list_source to specify dcode of file.name to extract spp.list from. Otherwise, supply vector of species names to species.

spp.list <- createSpp.list(species = NULL, 
                           taxo.dat = NULL, 
                           spp.list_src = spp.list_src)
species list extracted from dataset: D0
 file.name: brainmain2.csv
str(spp.list, vec.len = 3)
'data.frame':   1906 obs. of  4 variables:
 $ species    : chr  "Aix_galericulata" "Aix_sponsa" "Alopochen_aegyptiaca" ...
 $ master.spp : logi  TRUE TRUE TRUE TRUE ...
 $ rel.spp    : logi  NA NA NA NA ...
 $ taxo.status: chr  "original" "original" "original" ...
 - attr(*, "type")= chr "spp.list"

5.2 create master shell

master <- create_master(spp.list)
str(master, max.level = 2, vec.len = 3)
List of 3
 $ data    :'data.frame':   0 obs. of  11 variables:
  ..$ species    : logi(0) 
  ..$ synonyms   : logi(0) 
  ..$ data.status: logi(0) 
  ..$ var        : logi(0) 
  ..$ value      : logi(0) 
  ..$ data.ID    : logi(0) 
  ..$ qc         : logi(0) 
  ..$ observer   : logi(0) 
  ..$ ref        : logi(0) 
  ..$ n          : logi(0) 
  ..$ notes      : logi(0) 
  ..- attr(*, "format")= chr "master"
 $ spp.list:'data.frame':   1906 obs. of  4 variables:
  ..$ species    : chr [1:1906] "Aix_galericulata" "Aix_sponsa" "Alopochen_aegyptiaca" ...
  ..$ master.spp : logi [1:1906] TRUE TRUE TRUE TRUE ...
  ..$ rel.spp    : logi [1:1906] NA NA NA NA ...
  ..$ taxo.status: chr [1:1906] "original" "original" "original" ...
  ..- attr(*, "type")= chr "spp.list"
 $ metadata:'data.frame':   38 obs. of  7 variables:
  ..$ code  : chr [1:38] "species" "brain.volume" "brain.mass" ...
  ..$ cat   : chr [1:38] "NOMINAL" "MORPHOLOGICAL" "MORPHOLOGICAL" ...
  ..$ descr : chr [1:38] "Scientific name" "Brain volume" "Brain mass" ...
  ..$ scores: chr [1:38] NA NA NA ...
  ..$ levels: chr [1:38] NA NA NA ...
  ..$ type  : chr [1:38] NA "CON" "CON" ...
  ..$ units : chr [1:38] NA "mm" "g" ...
 - attr(*, "type")= chr "master"
 - attr(*, "file.names")= chr "empty"



6 match new datasets to master

6.1 create match object from file.name.

The fuction loads the file specified by filename in input.folder/pre/csv/. The argument sub specifies which of the two sets of species to be matched (spp.list or data) is a subset (ie smaller) than the other. spp.list is the spp.list attached to the master.

filename <- file.names[file.names == "Amniote_Database_Aug_2015.csv"]

m <- matchObj(file.name = filename,
              spp.list = master$spp.list,
              sub = "spp.list") # use addMeta function to manually add metadata.
List of 7
 $ data.ID  : chr "D1"
 $ data     :'data.frame':  9802 obs. of  20 variables:
  ..- attr(*, "format")= chr "data:wide"
 $ spp.list :'data.frame':  1906 obs. of  4 variables:
  ..- attr(*, "type")= chr "spp.list"
 $ sub      : chr "spp.list"
 $ set      : chr "data"
 $ meta     :List of 5
  ..- attr(*, "type")= chr "meta"
 $ file.name: Named chr "Amniote_Database_Aug_2015.csv"
  ..- attr(*, "names")= chr "x1"
 - attr(*, "type")= chr "match.object"
 - attr(*, "status")= chr "unmatched"

6.2 compile metadata and prepare m for matching

m <- m %>% 
  separateDatMeta() %>% 
  compileMeta(input.folder = input.folder) %>%
  checkVarMeta(master$metadata) %>%
  dataMatchPrep()
[1] "============================================================================"
[1] "processing meta.var: qc"
[1] "Warning: NULL data for meta.var: qc"
[1] "============================================================================"
[1] "processing meta.var: observer"
[1] "Warning: NULL data for meta.var: observer"
[1] "============================================================================"
[1] "processing meta.var: ref"
[1] "loading 'ref/Amniote_Database_Aug_2015.csv'"
[1] "============================================================================"
[1] "processing meta.var: n"
[1] "loading 'n/Amniote_Database_Aug_2015.csv'"
[1] "n vars matched successfully to post/n/Amniote_Database_Aug_2015_n_group.csv"
[1] "============================================================================"
[1] "processing meta.var: notes"
[1] "Warning: NULL data for meta.var: notes"
[1] "D1 metadata complete"
List of 7
 $ data.ID  : chr "D1"
 $ data     :'data.frame':  9802 obs. of  22 variables:
 $ spp.list :'data.frame':  1906 obs. of  4 variables:
  ..- attr(*, "type")= chr "spp.list"
 $ sub      : chr "spp.list"
 $ set      : chr "data"
 $ meta     :List of 5
  ..- attr(*, "type")= chr "meta"
 $ file.name: Named chr "Amniote_Database_Aug_2015.csv"
  ..- attr(*, "names")= chr "x1"
 - attr(*, "type")= chr "match.object"
 - attr(*, "status")= chr "unmatched"

6.3 match m to spp.list

m <- dataSppMatch(m, syn.links = syn.links, addSpp = T)
Warning in dataSppMatch(m, syn.links = syn.links, addSpp = T): match
incomplete, 8 spp.list datapoints unmatched
List of 8
 $ data.ID  : chr "D1"
 $ data     :'data.frame':  1898 obs. of  22 variables:
  ..- attr(*, "format")= chr "data:wide"
 $ spp.list :'data.frame':  1906 obs. of  4 variables:
  ..- attr(*, "type")= chr "spp.list"
 $ sub      : chr "spp.list"
 $ set      : chr "data"
 $ meta     :List of 5
  ..- attr(*, "type")= chr "meta"
 $ file.name: Named chr "Amniote_Database_Aug_2015.csv"
  ..- attr(*, "names")= chr "x1"
 $ unmatched:'data.frame':  8 obs. of  2 variables:
 - attr(*, "type")= chr "match.object"
 - attr(*, "status")= chr "matched:incomplete - 8 spp.list unmatched"

6.3.1

output <- masterDataFormat(m, meta.vars, match.vars, var.vars)
Warning in masterDataFormat(m, meta.vars, match.vars, var.vars): 1575 data points missing reference information!:
 (9.6% of 16469)
str(output, max.level = 1, vec.len = 3)
List of 3
 $ data     :'data.frame':  16469 obs. of  11 variables:
  ..- attr(*, "format")= chr "master"
 $ spp.list :'data.frame':  1906 obs. of  4 variables:
  ..- attr(*, "type")= chr "spp.list"
 $ file.name: Named chr "Amniote_Database_Aug_2015.csv"
  ..- attr(*, "names")= chr "D1"
 - attr(*, "type")= chr "data:master"
 - attr(*, "status")= chr "matched"

6.3.2 merge to master

master <- updateMaster(master, output = output)
str(master, max.level = 1, vec.len = 3)
List of 3
 $ data    :'data.frame':   16469 obs. of  11 variables:
  ..- attr(*, "format")= chr "master"
 $ spp.list:'data.frame':   1906 obs. of  4 variables:
  ..- attr(*, "type")= chr "spp.list"
 $ metadata:'data.frame':   38 obs. of  7 variables:
 - attr(*, "type")= chr "master"
 - attr(*, "file.names")= Named chr "Amniote_Database_Aug_2015.csv"
  ..- attr(*, "names")= chr "D1"


7 interactive view of framework objects:

7.1 master

jsonedit(master)


7.2 matched m

jsonedit(m)
---
title: "rmacroRDM_workflow"
author: 
date: "last rendered: `r format(Sys.time(), '%d %B, %Y')`"
output: 
  html_document:
    toc: true
    toc_float: true
    number_sections: true
    theme: paper
    
    
---


***

<br>

<font size="4.5"><b>Tools to handle trait macroecological datasets in R</b></font>


The rmacroRDM package contains functions to help with the compilation of
macroecological datasets. It compiles datasets into a **master long
database of individual observations**, matched to a specified **master
species list**. It also *checks, separates and stores taxonomic and
metadata information* on the *observations*, *variables* and *datasets*
contained in the data. It therefore aims to ensure full traceability of
datapoints and as robust quality control, all the way through to the
extracted analytical datasets.

**For more details and context, see the rmacroRDM github [repo](https://github.com/annakrystalli/rmacroRDM)**

R code for this workflow available [here](https://raw.githubusercontent.com/annakrystalli/rmacroRDM/master/utils/temp_vignette.R)

***
<br>

# setup

```{r global-setup, echo = F}
rm(list=ls())
options(stringsAsFactors = F)
```

```{r rmd-setup, echo=FALSE, purl=FALSE, warning=FALSE, message=FALSE}
require(RCurl)
require(knitr)
library(listviewer)
knitr::opts_chunk$set(echo = TRUE)
```


## source rmacroRDM functions

First source the rmacroRDM functions. Currently the best way is to just source from github using `RCulr:getURL()`.

***WARNING: Repo under continuous development***

```{r source-rmacroRDM, message=FALSE, warning=FALSE}
require(RCurl)
eval(parse(text = getURL("https://raw.githubusercontent.com/annakrystalli/rmacroRDM/master/R/functions.R", ssl.verifypeer = FALSE)))
eval(parse(text = getURL("https://raw.githubusercontent.com/annakrystalli/rmacroRDM/master/R/wideData_function.R", ssl.verifypeer = FALSE)))

```

## setup file.system

Next, to initialise the project we need to supply valid pathways to project folders containing:

- **data\:** folder containg private input and output folders are stored, usually a googledrive folder) and 
- **code:** folder containing scripts associated with the project. This is usually an RStudio project directory, ideally version controlled on github.

The function sets those directories in the global environment.

```{r set-dirs}

setDirectories(script.folder = "~/Documents/workflows/Brain_size_evolution/", 
               data.folder = "~/Google Drive/Brain evolution/",
               envir = globalenv())

```

Once project folders have been set, we set up the *file system* by creating the required folders (if they don't exist already) in the project folders. 

```{r setup-file.system}

setupFileSystem(script.folder = "~/Documents/workflows/Brain_size_evolution/", 
                data.folder = "~/Google Drive/Brain evolution/")
```

## initialise database configurations and attach to search pathway

In this step, we initialise the environment with some required parameters to build the database and process files to it. The call below shows the default initialisation settings which you would get if you just called `inti_db()`

```{r master-configuration-default, eval=F, purl=FALSE}
init_db(var.vars = c("var", "value", "data.ID"),
                    match.vars = c("synonyms", "data.status"),
                    meta.vars = c("qc", "observer", "ref", "n", "notes"),
                    taxo.vars = c("genus", "family", "order"),
                    spp.list_src = NULL)
```

I actually want to set **"D0"** as the file from which to extract the spp.list in a bit so I set `spp.list_src = "D0"`.

```{r master-configuration, eval=T}
init_db(spp.list_src = "D0")
```

The function appends the given arguments to environment `master_config` at position 2 in the search path (note position of `GlobalEnvironment` = 1). 

Here's a list of the values of the objects we just attached as configurations:
```{r print-master_config, echo=FALSE, purl=FALSE}
ms_conf <- setNames(lapply(ls("master_config"), FUN = get, envir = environment()), 
         ls("master_config"))
jsonedit(ms_conf)
```


## setup input.folder

Next we setup the folders in `input.folder/pre/` and `post/` according to the configurations set in the previous step.

If the correct setup already exists, no action is taken:

```{r setup-input.folder}
setupInputFolder(input.folder)
```


```{r print-ia-folder-str, purl=FALSE}
dirs <- list.dirs(paste(input.folder, "pre/", sep = ""), full.names = T)

print(dirs)
```

<br>

***

# populate file.system

The functions take advantage of the structure of the file.sytem to automate loading and linking of data and metadata through appropriate naming and location of files within the file.system.

- **`raw/`** organise all raw data
- **`pre/` save copies of the raw data files in the appropriate data (`cvs`) or `meta.vars` folders.**

***NB*** *`meta.var` data sheets should be named with the same name as the `data` data sheet. Take care during this stage to ensure files are named correctly and stored in the appropriate folders.* 

# load required data

## make fcodes (folder codes) vector

The `fcodes` vector specifies the details of folders in `pre/` and `post/` `input.folder` folders. It also creates appropriate code prefixes for each type of `data` or `meta.var` sheet. Note that **"D"** is reserved for **`data`** files, **"R"** for **`ref`** files and **"N"** for **`n`** files.

```{r ensure-fcodes}
fcodes <- ensure_fcodes(meta.vars)

```

```{r print-fcodes, purl=FALSE}
print(fcodes)
```


## set file.names vector

Specify the `file.names` of the files you wish to process. If you want to process all files in the file.system use `file.names = NULL`. 

```{r set-file.names}
file.names <- create_file.names(file.names = c("brainmain2.csv", 
                                  "Amniote_Database_Aug_2015.csv", "anagedatasetf.csv"))


```

```{r print-file.names, purl=FALSE}
print(file.names)
```


## load system reference files

Load the system reference (sys.ref) files required for data processing.
```{r load-sys.ref}
load_sys.ref(fileEncoding = "mac", view = F)
```
*use `view = T` to open a viewer for each of the sys.ref files on load.

### `metadata.csv`

```{r table-metadata, purl=FALSE}

kable(head(metadata, 10))
```


### `data_log.csv`

```{r table-data_log, purl=FALSE}


kable(data_log)
```

### `vnames.csv`

```{r table-vnames, purl=FALSE}

kable(head(vnames, 10))
```

## load `syn.links.csv`

Used for taxonomic matching (plans to automate this by integrating package `taxize`. supplied syn.links only relates to birds).

```{r load-syn.links}
syn.links <- read.csv(text=getURL("https://raw.githubusercontent.com/annakrystalli/rmacroRDM/master/data/input/taxo/syn.links.csv", 
                                  ssl.verifypeer = FALSE), header=T)
```

<br>

***

# process file.system

This step processes the `.csv` copies of the raw data in the `pre/` folder and writes the processed files as `.csv` to the `post/` folder, ready to be matched. The processing conserves the file.names throughout. It is important for this step that the file.system is correctly populated (ie. `meta.var` data sheets should be named with the same name as the `data` data sheet and in the correct folder). 



<br>

**The function runs a basic processing stage for each file in `file.names`:**

- processing is iterated over:
    - **all file.names** available in the file.system if `file.system = "fromFS"`,
    - **files specified in `file.names`** if `file.names` is vector of file names (note that only files available in the file.system are processed).
- data are loaded, trimmed of whitespace, blank lines and `c("", " ", "NA", "-999")` coded as `NA`s by default.
- column names is the data are matched to master variable codes through `vnames`
- if the original dataset contains species details across two columns (ie *species* and *genus*), data in the columns are concatinated in the form `"genus_species"` and merged into a single `species` column. ***ensure column in files containing genus data is matched to code `genus` in `vnames`.*** 

<br>

**custom processing scripts**

you can include extra processing scripts for individual files by adding them into the `{script.folder}process/` folder. To be loaded correctly, **scripts need to be named appropriately:**

- to source across *all files* in file.system matching the `file.name`: name script as `file.name`. eg .`"Amniote_Database_Aug_2015.R"`
- to source for *files in a specific folder* in file.system matching the `file.name`: name script as `file.name` appended with appropriate `fcode`, eg .`"Amniote_Database_Aug_2015_ref.R"`


```{r process-csvs, warning=FALSE}

process_file.system(file.names, fcodes)

```

<br>

***

# Create database

## create spp.list

Use `spp.list_source` to specify dcode of file.name to extract spp.list from. Otherwise, 
supply vector of species names to `species`.

```{r create-spp.list}
spp.list <- createSpp.list(species = NULL, 
                           taxo.dat = NULL, 
                           spp.list_src = spp.list_src)

```

```{r print-str-spp.list, purl=FALSE}
str(spp.list, vec.len = 3)
```


## create master shell

```{r create-master}
master <- create_master(spp.list)
```

```{r print-str-master, purl=FALSE}
str(master, max.level = 2, vec.len = 3)
```

<br>

***

# match new datasets to master

## create match object from file.name. 

The fuction loads the file specified by `filename` in `input.folder/pre/csv/`. The argument `sub` specifies which of the two sets of species to be matched (`spp.list` or `data`) is a subset (ie smaller) than the other. `spp.list` is the spp.list attached to the master.
```{r create-m}

filename <- file.names[file.names == "Amniote_Database_Aug_2015.csv"]

m <- matchObj(file.name = filename,
              spp.list = master$spp.list,
              sub = "spp.list") # use addMeta function to manually add metadata.

```

```{r print-prematch-m, echo=FALSE, purl=FALSE}
str(m, max.level = 1, vec.len = 3)
```

## compile metadata and prepare m for matching

```{r process-m}
m <- m %>% 
  separateDatMeta() %>% 
  compileMeta(input.folder = input.folder) %>%
  checkVarMeta(master$metadata) %>%
  dataMatchPrep()
```


```{r print-prematch1-m, echo=FALSE, purl=F}
str(m, max.level = 1, vec.len = 3)
```

## match m to spp.list

```{r data-spp-match}
m <- dataSppMatch(m, syn.links = syn.links, addSpp = T)
```


```{r print-postmatch-m, echo=FALSE, purl=F}
str(m, max.level = 1, vec.len = 3)
```

### 

```{r output}
output <- masterDataFormat(m, meta.vars, match.vars, var.vars)
```

```{r print-output, purl=F}
str(output, max.level = 1, vec.len = 3)
```


### merge to master

```{r merge-to-master}
master <- updateMaster(master, output = output)
```

```{r print-master, purl=F}
str(master, max.level = 1, vec.len = 3)
```


***
<br>

# interactive view of framework objects:

## master
```{r print-ia-master, purl=F}
jsonedit(master)
```

<br>

## matched m
```{r print-ia-m, purl=F}
jsonedit(m)
```

